Discovering Characteristic Expressions from Literary Works: A New Text Analysis Method beyond N-Gram Statistics and KWIC
نویسندگان
چکیده
We attempt to extract characteristic expressions from literary works. That is, our problem is, given literary works by a particular writer as positive examples and works by another writer as negative examples, to find expressions that appear frequently in the positive examples but do not so in the negative examples. It is considered as a special case of the optimal pattern discovery from textual data, in which only the substring patterns are considered. One reasonable approach is to create a list of substrings arranged in the descending order of their goodness, and to examine a first part of the list by a human expert. Since there is no word boundary in Japanese texts, a substring is often a fragment of a word or a phrase. How to assist the human expert is a key to success in discovery. In this paper, we propose (1) to restrict to the prime substrings in order to remove redundancy from the list, and (2) a way of browsing the neighbor of a focused string as well as its context. Using this method, we report successful results against two pairs of anthologies of classical Japanese poems. We expect that the extracted expressions will possibly lead to discovering overlooked aspects of individual poets.
منابع مشابه
Temporal Analysis of Literary and Programming Prose
Literary works reference a variety of globally shared themes including well-known people, events, and time periods. It is particularly interesting to locate patterns that are either invariant across time or exhibit a characteristic change across time, as they could imply something important about society that those works record. This paper suggests the use of Google n-gram viewer as a fast prot...
متن کاملAutomatically Assessing Whether a Text Is Cliched, with Applications to Literary Analysis
Clichés, as trite expressions, are predominantly multiword expressions, but not all MWEs are clichés. We conduct a preliminary examination of the problem of determining how clichéd a text is, taken as a whole, by comparing it to a reference text with respect to the proportion of more-frequent n-grams, as measured in an external corpus. We find that more-frequent n-grams are over-represented in ...
متن کاملA deconstructive critique of a mystical anecdote from the book Ronaq al-Majalis [The Prosperity of Meetings]
Deconstruction was first introduced in the thought of Jacques Derrida as a way of re-reading texts and questioning its presuppositions. This type of critique seeks to find new meanings by finding binary oppositions in the text and disrupting the superiority and domination of one side over the other, and on the other hand, by discovering gaps and discontinuities that have arisen in the text...
متن کاملAuthor gender identification from text using Bayesian Random Forest
Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields,...
متن کاملThe Relationship between Stylish Characteristics of Wittgenstein’s Writing and his Philosophizing
Style and form of Wittgenstein’s writing are unique in contemporary philosophical literature. Thinkers take different stands about role of style and form of writing in Wittgenstein’s works. Some of them considered form and style to be aesthetic element which weakens argumentations of his text. In contrast, the others, believe that he was aware of diverse options of form and style and chosen his...
متن کامل